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The 

Structure 
of a 

Compiler 



Lexical Analysis 

2. Parsing 

3. Semantic Analysis 

4. Optimization 

5. Code Generation 



Lexical Analysis phase is the first phase of the long 
series implementation of compilers. 

The Lexical Analyzer scans the input string to 
isolate its words and identify tokens. 

Token = Lexeme = word (The basic building block 
in any language). 
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Lexical 

Analysis 



What do we want to do? Example: 

if (i==j) 

Z = 0; 

else 

Z= 1; 

The analyzer will see this piece of code as just a 
string of characters with all white spaces symbols: 

\tif (i == j)\n\t\tz = O:\n\telse\n\t\tz = 1; 



Goal: Partition input string into substrings 
Where the substrings are tokens 

Token is a sequence of characters that have a 
collective meaning. 
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What's a 
Token? 



Lexical Analyzer does not just recognize the lexical units 
, it also classify these units according to their role 

A syntactic category 
In English: 

noun, verb, adjective, ... 

In a programming language: 

Identifier, Integer, Keyword, Whitespace, ... 
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Tokens correspond to sets of strings. 



Tokens 



Identifier: strings of letters or digits, starting with a letter 

Integer: a non-empty string of digits 

Keyword: "else" or "if" or "begin" or ... 

Whitespace: a non-empty sequence of blanks, newlines, 
and tabs 
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Classify program substrings (tokens) according to role. 



Output of lexical analysis is a stream of tokens . . . 



Lexical 

Analysis 

Goal 



. . . which is input to the parser 



An In put string 



Lexical 


Tokens (sequence of pairs) 


Syntax 


Analysis 




Analysis 



Parser relies on token distinctions 

An identifier is treated differently than a keyword 
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Define a finite set of tokens 



Designing a 
Lexical 
Analyzer: 
Step 1 



Tokens describe all items of interest 

Choice of tokens depends on language, design of parser 
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Example 



• Recall 

\+if (i == j)\n\t\tz = O:\n\telse\n\t\tz = 1; 

Useful tokens for this expression: 

Integer, Keyword, Relation, Identifier, Whitespace, (, ), =, ; 

./ (/ )/ = , / are tokens, not characters, here. 

Some single character token classes will be marked by 
themselves 
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For the code fragment below, 
choose the correct number of tokens in 
each class that appear in the code fragment 

x = O:\n\twhile (x < 10) {\n\tx++;\n} 

Quiz 

0 W = 9; K = 1; I = 3; N = 2; 0 = 9 
0 W = 11; K = 4; I = 0; N = 2; 0 = 9 
0 W = 9; K = 4; I = 0; N = 3; 0 = 9 
0 W = 11; K = 1; I = 3; N = 3; 0 = 9 



W: Whitespace 
K: Keyword 
I: Identifier 
N: Number 
0: Other Tokens: 
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Designing a 
Lexical 
Analyzer: 
Step 2 



Describe which strings belong to each token 
• Recall: 

Identifier: strings of letters or digits, starting with a letter 
Integer: a non-empty string of digits 
Keyword: "else" or "if" or "begin" or ... 

Whitespace: a non-empty sequence of blanks, newlines, 
and tabs 
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Lexical 

Analyzer: 

Implementation 



An implementation must do two things: 

1. Recognize substrings corresponding to tokens 
(lexemes) 

2. Identify the token class of each lexeme. 
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Example 



Recall: 

\+if (i == j)\n\t\tz = O:\n\telse\n\t\tz = 1; 
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Lexical 

Analyzer: 

Implementation 



The lexer usually discards “uninteresting" tokens 
that don't contribute to parsing. 



Examples: Whitespace, Comments 
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Is it as easy as it sounds? 

True Crimes 

of Lexical Not quite! 

Analysis 

Look at some history . . . 
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Lexical 


FORTRAN rule: Whitespace is insignificant 


Analysis in 
FORTRAN 


E.g., VAR1 is the same as VA R1 




A terrible design! 
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Consider 
DO 5 1= 1,25 
DO 5 1= 1.25 

An example of lexical analysis that requires Lookahead. 

Example i 

One of the goals of lexical analysis is to minimize the 
amount of lookaheads or bound it to some constant to 
simplify the implementation of lexical analyzer. 
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Example 2 



Consider 

-f (i==j) 

Z = 0; 
else 
Z= 1; 

Even this simple example has lookahead issues 

• i vs. if 

• = vs. == 



Footnote: FORTRAN Whitespace rule motivated by 
inaccuracy of punch card operators 
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Lexical 
Analysis in 
FORTRAN 
(Cont.) 



Two important points: 

1. The goal is to partition the string. This is 
implemented by reading left-to-right , recognizing 
one token at a time 

2. "Lookahead" may be required to decide where one 
token ends and the next token begins 
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Lexical 
Analysis in 
PL/I 



PL/I keywords are not reserved: you can use a keyword 
both as a keyword and also as a variable name 

IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN 
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Lexical 
Analysis in 
PL/I 
(Cont.) 



PL/I Declarations: 

DECLARE (ARG1 ARGN) 



Can't tell whether DECLARE is a keyword or an array 
reference until after the ). 

Requires lookahead!. Scan beyond this entire argument 
list to see what came next. 
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Unfortunately, the problems continue today 



Lexical 
Analysis in 
C++ 



C++ template syntax: 

Foo<Bar> 

C++ stream syntax: 

cin » var; 

But there is a conflict with nested templates: 
Foo<Bar<Bazz» 
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Review 


The goal of lexical analysis is to 

Partition the input string into lexemes 
Identify the token class of each lexeme 




Left-to-right scan => lookahead sometimes required 
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Next 



We still need 

A way to describe the lexemes of each token 

A way to resolve ambiguities 

• Is if two variables i and f? 

• Is == two equal signs = =? 
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There are several formalisms for specifying tokens 



Regular 

Languages 



We must say what set of strings belongs to a token 
class 

Use regular Languages 



Regular languages are the most popular 
Simple and useful theory 
Easy to understand 
Efficient implementations 
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Atomic 

Regular 

Expressions 



To define regular functions, we must define regular 
expressions. 

Single character 



Epsilon 
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Compound 

Regular 

Expressions 



• Union 

A+ B = {s I s e A or s e B} 

• Concatenation 

AB — \ab I a ^ A and b e B} 

• Iteration 

A* = I I A' where A 1 = A...i times ...A 

v-A> o 
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Regular 

Expressions 



Def. The regular expressions R over S are the smallest set of 
expressions including 

£ 

' c ' where c e X 

A + B where A, B are rexp over X 

AB 

A* where A is a rexp over X 



2 is the given alphabet (family of characters that form any regular 
expression). 
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Regular 

Expressions 

Quiz 



Choose the regular languages that 

are equivalent to the given regular language: (0 + 1)*1(0 + 1)* 

□ (01 + 11)*(0 + 1 )* S = {0/ i 

□ (0 + l)*(10 + ll + l)(0 + l)* 

□ (l + 0)*l(l + 0)* 

□ (0+l)*(0+l)(0 + l)* 
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Review 



Regular expressions specify regular languages. 
Regular expressions have Five constructs 

- Two base cases 

• empty and l-character strings 

- Three compound expressions 

• union, concatenation, iteration 

Regular expressions are simple, almost trivial 
But they are useful! 

Reconsider informal token descriptions . . . 
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Languages 



Def. LetS be a set of characters. A language over S 
is a set of strings of characters drawn from S 
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Examples 

of 

Languages 



Alphabet = English 
characters 

Language = English 
sentences 



Not every string of 
English characters is 
an English sentence 



Alphabet = ASCII 

Language = C 
programs 

Note: ASCII character 
set is different from 
English character set 
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Syntax vs. 
Semantics 



To be careful, we should distinguish syntax and 
semantics. 



Us) 


= {""} 


L('c’) 


= {V"} 


L(A + B) 


= L(A)kj L(B) 


L(AB ) 


= {ab 1 a <e L(A ) and Z? e Z.(Z?)} 


L(A *) 


= 
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Syntax vs. 
Semantics 



Meaning function L maps syntax to semantics. A piece of 
syntax is converted to a set of strings. 

Meaning function L maps from expressions into sets. 

Why use a meaning function? 

Makes clear what is syntax, what is semantics. 

Allows us to consider notation as a separate issue 

Because expressions and meanings are not 1-1 

Meaning is many to one: many different syntax pieces 
map to the same meaning. 

Never one to many! 
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Notation 



Languages are sets of strings. 



Need some notation for specifying which sets we want 



The standard notation for regular languages is regular 
expressions. 
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Keyword: "else" or "if" or "begin" or ... 
else' + 'if' + 'begin' + . . . 

Example: 

Keyword Note: else 1 abbreviates 'e'T's'e' 
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Integer: a non-empty string of digits 



Example: 

Integers 



digit = '0'+T+'2V3V4'+'5'+'6V7'+'8V9' 
integer = digit digit* 



Abbreviation: A + = A4* 
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Identifier: strings of letters or digits, starting 
with a letter. 



Example: 

Identifier 



letter = 'A' + ... + 'Z' + 'a' + ... + 'z 
identifier = letter (letter + digit)* 

Is (letter* + digit*) the same? 
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Example: 

Whitespace 



Whitespace: a non-empty sequence of blanks, 
newlines, and tabs 

(' ' + \n’ + \t’) + 
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Example: 

Phone 

Numbers 



Regular expressions are all around you! 
Consider (65o)-723~3232 



x 

exchange 

phone 

area 

phone_number 



digits {-,(,)} 

digit 3 

digit 4 

digit 3 

'(' area exchange phone 
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Example: 

Email 

Addresses 



Consider anyone@cs.stanford.edu 

Z = letters u{.,@| 

name = letter + 

address = name '@' name name name 
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Example: 

Unsigned opt_fraction 

Pascal opt_exponent 

Numbers 



num 



'0' +T+'2'+'3'+'4'+'5'+'6'+'7'+'8'+'9' 
digit* 

digits) + s 

('E' ('+' + + £) digits) + f 

digits opt_fraction opt_exponent 
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Choose the regular languages that are correct 
specifications of the English-language description given below: 

Twelve-hour times of the form " 04:13PM ", Minutes should always 
be a two digit number ; but hours may be a single digit 

□ (0 + l)?[0-9]:[0-5][0-9](AM + PM) 

Quiz 

□ ((0 + e)[0-9] + l[0-2]):[0-5][0-9](AM + PM) 

□ (0*[0-9] + 1 [0-2] ) : [0-5] [0-9] (AM + PM) 

□ (0?[0-9] + 1(0 + 1 + 2):[0-5][0-9](A + P)M 



44 



Summary 



Regular expressions describe many useful languages 



Regular languages are a language specification 
We still need an implementation 
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